[Tutorial] Building a k3s Cluster from Scratch

k3s High Availability Cluster Cover Image

Tip: This tutorial is written based on Linux + fish shell. Some command syntax may differ slightly from Bash. Please adjust commands according to your system environment.

Introduction

As we all know, k3s is a lightweight Kubernetes distribution developed by Rancher Labs, and is currently the most popular K8s lightweight solution.

Compared to traditional operations methods (1Panel/Baota/SSH, etc.), k3s has a steeper learning curve and requires understanding more container orchestration concepts. However, once mastered, you will gain:

Core Advantages of k3s

Lightweight and Efficient - Single binary file, memory usage < 512MB, perfect for low-spec VPS
Production Ready - Fully compatible with Kubernetes API, smooth migration to standard K8s
Declarative Operations - Describe desired state with YAML, system automatically maintains it
High Availability Guarantee - Automatic fault recovery + multi-node load balancing
Out of the Box - Built-in networking, storage, Ingress and other core components

Through k3s, we can consolidate multiple cheap VPS into an enterprise-grade highly available cluster, achieving automation levels difficult to reach with traditional operations.

Target Audience and Preparation

Suitable For

Developers with some Linux foundation
Those wanting to transition from traditional operations to container orchestration
Tech enthusiasts wanting to build personal high-availability services

Prerequisites

Familiar with Linux command-line operations
Understanding of Docker container basics
Basic networking knowledge (SSH, firewall)

Learning Outcomes

After completing this tutorial, you will master:

Using k3sup to quickly deploy k3s clusters
Understanding the role of k3s core components (API Server, etcd, kubelet, etc.)
Replacing default components to optimize performance (Cilium CNI, Nginx Ingress, etc.)
Deploying your first application and configuring external access
Basic cluster operations and troubleshooting techniques

Deployment Planning

k3s comes with a set of streamlined components by default. To meet production-level requirements, we need to plan in advance which modules to keep or replace. The following table shows the recommended trade-off strategy:

Component Type	k3s Default	Replacement	Reason	k3sup Disable Parameter
Container Runtime	containerd	-	Keep default	-
Data Storage	SQLite / etcd	-	SQLite for single node, etcd for cluster	-
Ingress Controller	Traefik	Nginx Ingress / Others	Team familiarity, different feature requirements	`--disable traefik`
LoadBalancer	Service LB (Klipper-lb)	Don't deploy	Lazy to manage load balancing, just use cloudflare	`--disable servicelb`
DNS	CoreDNS	-	Keep default	-
Storage Class	Local-path-provisioner	Longhorn	Distributed storage, high availability, backup capability	`--disable local-storage`
CNI	Flannel	Cilium	eBPF performance, network policies, observability	`--flannel-backend=none --disable-network-policy`

k3s Default Components and Replacement Solutions Comparison

Environment Preparation

Required Tools

First, ensure that k3sup, kubectl, Helm and other tools are correctly installed (refer to the installation instructions provided by their respective GitHub projects):

k3sup version
kubectl version
helm version

As shown in the figure:

Tool Installation Version Check Screenshot

Server Requirements

Prepare at least three cloud servers (example environment uses Ubuntu 24.04) as a minimal three-node high-availability control plane (recommended configuration ≥ 4C4G). Record each node's IP in advance, confirm SSH public keys are distributed, and know the local private key path, which will be used in subsequent steps.

Deploy Initial Control Plane

Use k3sup to deploy the initial control node:

k3sup install \
  --ip <INITIAL_NODE_IP> \
  --user root \
  --ssh-key <KEY_PATH> \
  --k3s-channel latest \
  --cluster \
  --k3s-extra-args "--disable traefik --disable servicelb --disable local-storage --flannel-backend=none --disable-network-policy"

As shown in the figure: k3sup Install Initial Control Node Output

After installation, k3sup will automatically copy kubeconfig to the current directory. While you can use this configuration directly, a more robust approach is to merge it with existing configuration files.

Merge kubeconfig

Backup your current kubeconfig (located at ~/.kube/config by default):

cp ~/.kube/config ~/.kube/config.backup

fish shell (for reference only):
cp $KUBECONFIG {$KUBECONFIG}.backup

Merge old and new kubeconfig into a single flat file:

KUBECONFIG=~/.kube/config:./kubeconfig kubectl config view --flatten > ~/.kube/config.new

fish shell (for reference only):

KUBECONFIG=$KUBECONFIG:./kubeconfig kubectl config view --flatten > kubeconfig-merged.yaml

After verifying the new file content is correct, overwrite the old configuration:

mv ~/.kube/config.new ~/.kube/config

fish shell:

mv ./kubeconfig-merged.yaml $KUBECONFIG

Verify that the new context is active:

kubectl config get-contexts
kubectl config use-context default

Install Cilium (Replace Flannel)

When installing k3s, we disabled the default CNI (Flannel), so nodes cannot communicate with each other temporarily. Deploy Cilium as planned to provide networking and network policy capabilities.

Install Cilium using Helm:

# Add Cilium Helm repository
helm repo add cilium https://helm.cilium.io/

# Update Helm repository
helm repo update

# Install Cilium CNI (single replica, default mode)
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set operator.replicas=1 \
  --set ipam.mode=kubernetes

If kube-proxy is already disabled in the cluster, you can additionally add --set kubeProxyReplacement=strict. This tutorial keeps the default value for broader compatibility.

After execution, you should see: Helm Install Cilium Terminal Output Cilium Pod and Node Status Check Output

Wait for Cilium components to start:

# Check Cilium-related Pod status
kubectl get pods -n kube-system -l k8s-app=cilium

# Check node status (should change from NotReady to Ready)
kubectl get nodes

kubectl Control Plane Node Ready Screenshot

After switching to Cilium, nodes should change from NotReady to Ready. Next, use k3sup to join the other two control nodes.

Scale Control Plane

To achieve high availability, we need at least 3 control nodes. Use the following command to join the second control node:

k3sup join \
  --ip <SECOND_NODE_IP> \
  --user root \
  --ssh-key <KEY_PATH> \
  --server-ip <INITIAL_NODE_IP> \
  --server \
  --k3s-channel latest \
  --k3s-extra-args "--disable traefik --disable servicelb --disable local-storage --flannel-backend=none --disable-network-policy"

Join the third control node in the same way:

k3sup join \
  --ip <THIRD_NODE_IP> \
  --user root \
  --ssh-key <KEY_PATH> \
  --server-ip <INITIAL_NODE_IP> \
  --server \
  --k3s-channel latest \
  --k3s-extra-args "--disable traefik --disable servicelb --disable local-storage --flannel-backend=none --disable-network-policy"

Wait a few minutes and then check the cluster status:

kubectl get nodes

When all three control nodes are in Ready status, the control plane setup is complete.

Cluster Status After Joining More Control Nodes

Install Nginx Ingress Controller

Replace k3s's default Traefik with Nginx Ingress Controller:

# Add Nginx Ingress Helm repository
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install Nginx Ingress Controller (may take a while)
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.hostPort.enabled=true \
  --set controller.hostPort.ports.http=80 \
  --set controller.hostPort.ports.https=443 \
  --set controller.service.type=ClusterIP

kubectl get pods -n ingress-nginx
kubectl get svc -n ingress-nginx

Nginx Ingress Controller Component Status Screenshot

Install Longhorn Distributed Storage

Longhorn is a cloud-native distributed block storage system developed by Rancher, providing enterprise-grade features such as high availability, backup, and snapshots. Compared to k3s's default local-path-provisioner, Longhorn supports persistent storage across nodes.

Pre-requisite Dependency Check

Before installing Longhorn, ensure that each node has the necessary dependencies installed; if you want to handle this in bulk with Ansible, refer to the example below:

# Execute on each node (via SSH)
# Check and install open-iscsi
apt update
apt install -y open-iscsi nfs-common

# Enable and start on boot
systemctl enable --now iscsid
systemctl status iscsid

If you want to use Ansible to install dependencies in bulk, refer to the following task snippet:

---
- name: Setup K3s nodes with Longhorn dependencies and CrowdSec
  hosts: k3s
  become: true
  vars:
    crowdsec_version: "latest"

  tasks:
    # ============================================
    # Longhorn Prerequisites
    # ============================================
    - name: Install Longhorn required packages
      ansible.builtin.apt:
        name:
          - open-iscsi          # iSCSI support for volume mounting
          - nfs-common          # NFS support for backup target
          - util-linux          # Provides nsenter and other utilities
          - curl                # For downloading and API calls
          - jq                  # JSON processing for Longhorn CLI
        state: present
        update_cache: true
      tags: longhorn

    - name: Enable and start iscsid service
      ansible.builtin.systemd:
        name: iscsid
        enabled: true
        state: started
      tags: longhorn

    - name: Load iscsi_tcp kernel module
      community.general.modprobe:
        name: iscsi_tcp
        state: present
      tags: longhorn

    - name: Ensure iscsi_tcp loads on boot
      ansible.builtin.lineinfile:
        path: /etc/modules-load.d/iscsi.conf
        line: iscsi_tcp
        create: true
        mode: '0644'
      tags: longhorn

    - name: Check if multipathd is installed
      ansible.builtin.command: which multipathd
      register: multipathd_check
      failed_when: false
      changed_when: false
      tags: longhorn

    - name: Disable multipathd if installed (conflicts with Longhorn)
      ansible.builtin.systemd:
        name: multipathd
        enabled: false
        state: stopped
      when: multipathd_check.rc == 0
      tags: longhorn

Deploy Longhorn

Install Longhorn using Helm:

# Add Longhorn Helm repository
helm repo add longhorn https://charts.longhorn.io
helm repo update

# Install Longhorn (may take a long time). Note that here we only set single replica to save disk space, please adjust according to your needs
helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --create-namespace \
  --set defaultSettings.defaultDataPath="/var/lib/longhorn" \
  --set persistence.defaultClass=true \
  --set persistence.defaultClassReplicaCount=1

Wait for Longhorn components to start:

# Check Longhorn component status
kubectl get pods -n longhorn-system

# Check StorageClass
kubectl get storageclass

Longhorn Component Running Status Screenshot Longhorn Storage Class List Screenshot

Access Longhorn UI (Optional)

Longhorn provides a Web UI for managing storage volumes. You can temporarily access it through port forwarding:

# Port forward to local machine
kubectl port-forward -n longhorn-system svc/longhorn-frontend 8081:80

# Access http://localhost:8081 in your browser

After viewing, press Ctrl+C in the terminal to stop port forwarding and avoid continuously occupying the local port.

Join Agent Nodes

So far, we have built a 3-node high-availability control plane. In production environments, we don't want to run actual applications on control-plane nodes (which would consume resources of core components like API Server and etcd).

Therefore, we need to join dedicated Agent nodes (also called Worker nodes) for running workloads (Pods).

Install Pre-requisite Dependencies

Like control-plane nodes, Agent nodes also need to meet Longhorn's dependencies (if you want these nodes to schedule and store persistent volumes).

Before joining, execute the following commands on all Agent nodes:

# Execute on each Agent node (via SSH)
apt update
apt install -y open-iscsi nfs-common

# Enable and start on boot
systemctl enable --now iscsid

Execute Join Command

The command to add Agent nodes is almost identical to adding control-plane nodes, but with two key differences: don't use the --server flag and no extra parameters.

k3sup join \
  --ip <AGENT_NODE_IP> \
  --user root \
  --ssh-key <KEY_PATH> \
  --server-ip <ANY_CONTROL_NODE_IP> \
  --k3s-channel latest

Verify Node Status

You can add multiple Agent nodes at once. After adding, wait a few minutes for Cilium and Longhorn components to automatically schedule to new nodes.

Use kubectl to check cluster status:

kubectl get nodes -o wide

You should see the newly joined nodes with <none> in the ROLE column (in k3s, <none> represents the agent/worker role). At the same time, you can monitor whether Cilium and Longhorn Pods successfully start on new nodes:

# Cilium agent should start on new nodes
kubectl get pods -n kube-system -o wide

# Longhorn instance-manager should also start on new nodes
kubectl get pods -n longhorn-system -o wide

At this point, the cluster now has a highly available control plane and worker nodes for running applications.

Next Steps

Configure GitOps tools like Argo CD and Flux to standardize cluster declarative management processes.
Deploy observability components like Prometheus, Loki, and Grafana to improve monitoring and logging systems.
Use Ansible Playbooks to automate node initialization and component upgrades, reducing subsequent operational costs.

Introduction​

Core Advantages of k3s​

Target Audience and Preparation​

Suitable For​

Prerequisites​

Learning Outcomes​

Deployment Planning​

Environment Preparation​

Required Tools​

Server Requirements​

Deploy Initial Control Plane​

Merge kubeconfig​

Install Cilium (Replace Flannel)​

Scale Control Plane​

Install Nginx Ingress Controller​

Install Longhorn Distributed Storage​

Pre-requisite Dependency Check​

Deploy Longhorn​

Access Longhorn UI (Optional)​

Join Agent Nodes​

Install Pre-requisite Dependencies​

Execute Join Command​

Verify Node Status​

Next Steps​

References​